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KARMA: Kalman-based autoregressive moving average modeling and 
inference for formant and antiformant tracking ^^ 

Daryush D. Mehta,'') Daniel Rudoy, and Patrick J. Wolfe'=) 

School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts 02138 

(Dated: July 4, 2011) 

Vocal tract resonance characteristics in acoustic speech signals are classically tracked using frame- 
by-frame point estimates of formant frequencies followed by candidate selection and smoothing 
using dynamic programming methods that minimize ad hoc cost functions. The goal of the cur- 
rent work is to provide both point estimates and associated uncertainties of center frequencies and 
bandwidths in a statistically principled state-space framework. Extended Kalman (K) algorithms 
take advantage of a linearized mapping to infer formant and antiformant parameters from frame- 
based estimates of autoregressive moving average (ARMA) cepstral coefficients. Error analysis of 
KARMA, WaveSurfer, and Praat is accomplished in the all-pole case using a manually marked 
formant database and synthesized speech waveforms. KARMA formant tracks exhibit lower over- 
all root-mean-square error relative to the two benchmark algorithms, with third formant tracking 
more challenging. Antiformant tracking performance of KARMA is illustrated using synthesized 
and spoken nasal phonemes. The simultaneous tracking of uncertainty levels enables practition- 
ers to recognize time-varying confidence in parameters of interest and adjust algorithmic settings 
accordingly. 



PACS numbers: 43.72.Ar, 43.70.Bk, 43.60.Cg, 43.60.Uv 



I. INTRODUCTION 

Speech formant tracking has received continued atten- 
tion over the past sixty years to better characterize for- 
mant motion during vowels as well as vowel-consonant 
boundaries. The de facto approach to resonance estima- 
tion involves waveform segmentation and the assumption 
of an all-pole model characterized by second-order digi- 
tal resonators (Schafer and Rabiner, 1970). The center 
frequency and bandwidth of each resonator are then es- 
timated through picking peaks in the all-pole spectrum 
or finding roots of the prediction polynomial. Tracking 
these estimates across frames is typically accomplished 
via dynamic programming methods that minimize cost 
functions to produce smoothly- varying trajectories. 

This general formant-tracking algorithm is imple- 
mented in WaveSurfer (Sjolander and Beskow, 2005) and 
Praat (Boersma and Weenink, 2009), speech analysis 
tools that enjoy widespread use in the speech recognition, 
clinical, and linguistic communities. There are, however, 
numerous shortcomings to this classical approach. For 
example, formant track smoothing and correction (e.g., 
for large frequency jumps) are performed in an ad hoc 
manner that precludes that ability to apply statistical 
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analysis to obtain confidence intervals around the esti- 
mated tracks. 

Initial development of the formant tracking approach 
described here has been reported by Rudoy et al. (2007) 
using a manually marked formant database for error anal- 
ysis (Deng et al., 2006b). The current work continues 
this line of research and offers two main contributions. 
The first provides improvements to the Kalman-based 
autoregressive approach of Deng et al. (2007) and ex- 
tensions to enable antiformant frequency and bandwidth 
tracking in a Kalman-based autoregressive moving aver- 
age (KARMA) framework. The second empirically deter- 
mines the performance of the KARMA approach through 
visual and quantitative error analysis and compares this 
performance with that of WaveSurfer and Praat. 



A. Classical formant tracking algorithms 

Linear predictive coding (LPC) models have been 
shown to efficiently encode source/filter characteristics of 
the acoustic speech signal (Atal and Hanauer, 1971). To 
extract frame- by-frame formant parameters, the poles of 
the LPC spectrum can be computed as the roots of the 
prediction polynomial, peaks in the LPC spectrum, or 
peaks in the second derivative of the frequency spectrum 
(Christensen et al., 1976). The first complete formant 
tracker over multiple continuous speech frames incorpo- 
rated spectral peak-picking, selection of formants from 
the candidate peaks using continuity constraints, and 
voicing detection to handle silent and unvoiced speech 
segments (McCandless, 1974). Extensions to LPC anal- 
ysis incorporate autoregressive moving average (ARMA) 
models that added estimates of candidate zeros associ- 
ated with anti-resonances during consonantal and nasal 
speech sounds (Stciglitz, 1977; Atal and Schroeder, 1978). 
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FIG. 1. Illustration of the (A) classical and (B) proposed approaches to formant tracking. Key advantages to the proposed 
KARMA approach include intra-frame observation of autoregressive moving average (ARMA) parameters for both formant and 
antiformant tracking, inter-frame tracking using linearized Kalman (K) inference, and the availability of both point estimates 
and uncertainties for each trajectory. 



Fig. lA illustrates the classical tracking process for 
the all-pole case. Following pre-processing steps, LPC 
spectral coefficients yield intra-frame point estimates of 
candidate frequency and bandwidth parameters via root 
finding or peak-picking. Intra-frame parameter estima- 
tion can be accomplished using a number of methods 
(Atal and Hanauer, 1971; Atal and Schroeder, 1978; 
Broad and Clermont, 1989; Yegnanarayana, 1978), and 
inter-frame parameter selection and smoothing can be 
performed by minimizing various cost functions in a dy- 
namic programming environment (Sjolandcr and Beskow, 
2005; Boersma and Weenink, 2009). Note that the re- 
quired root-finding (or peak-picking) procedure cannot 
be written in closed form. Consequently, statistical anal- 
ysis (distributions, bias, variance) of the resultant for- 
mant and bandwidth estimates is challenging. Alter- 
native spectrographic representations primarily apply to 
sustained vowels and require significant manual interac- 
tion (Fulop, 2010). 



B. Statistical formant tracking algorithms 

Probabilistic and statistical models for tracking for- 
mants have gained widespread use in the past 15 years 
with motivation from automatic speech recognition ap- 
plications. The first such probabilistic model was intro- 
duced by Kopec (1986), in which a hidden Markov model 
was used to constrain the evolution of vector-quantized 
sets of formant frequencies and bandwidths. Similarly, a 
state-space dynamical model can appropriately constrain 
the evolution of formant parameters, where observations 
of the acoustic speech waveform are linked through non- 
linear relationships to "hidden" states (formant parame- 
ters) that evolve over time. Inference of the state values 
can be performed by variants of Kalman filter algorithms 
(Kalman, 1960). 

In these algorithms, ad hoc assignment of poles and 
zeros to appropriate formant indices is precluded by the 
inherent association of spectral/cepstral coefficients to 
formant and antiformant frequencies and bandwidths. 



The first reported state-space approach to formant track- 
ing inferred formant frequencies and bandwidths directly 
from LPC spectral coefficients (RigoU, 1986). An ex- 
tension to this LPC approach was made by Toyoshima 
et al. (1991) to build a tracker that inferred frequen- 
cies and bandwidths of both formants and antiformants 
from time- varying ARMA spectral coefficients (Miyanaga 
et al., 1986). More recent state-space models define the 
observations as coefficients in the LPC cepstral domain 
(Zheng and Hasegawa- Johnson, 2004; Deng et al., 2007), 
providing statistical methods to support or refute empir- 
ical relations obtained between low-order cepstral coef- 
ficients and formant frequencies (Broad and Clermont, 
1989). 

The proposed KARMA (Kalman-based autoregressive 
moving average) approach explores the performance of 
such a state-space model with ARMA cepstral coeffi- 
cients as observations to track formant and antiformant 
parameters. Taking advantage of a linearized mapping 
between frequency and bandwidth values and cepstral 
coefficients, KARMA applies Kalman inference to yield 
point estimates and uncertainties for the output trajec- 
tories. 



II. METHODS 



Fig. IB illustrates the proposed statistical modeling 
approach to formant and antiformant tracking. This 
approach affords several advantages over classical ap- 
proaches: (1) both formant and antiformant trajecto- 
ries are tracked, (2) both frequency and bandwidth es- 
timates are propagated as distributions instead of point 
estimates to provide for uncertainty quantification, and 
(3) pole/zero assignment to formants/antiformants is 
made through a linearized cepstral mapping instead of 
candidate selection using ad hoc cost functions. 
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A. Step 1: Pre-processing 

The sampled acoustic speech waveform s[m\ is first 
windowed into short-time frames stlm] = s[?n]wi[?n] us- 
ing overlapping windows ^([to] with frame index t. Each 
short-time frame st[m\ is then pre-emphasized via 



st[m] = st[m\ - jst[m - 1], 



(1) 



where 7 is the pre-emphasis coefficient defining the in- 
herent high-pass filter characteristic that is typically ap- 
plied to equalize energy across the speech spectrum for 
improved model fitting. 






if 
if 
if 



n = 1 

1 < n < q 

q < n. 



(5b) 



Derivation of Eqs. (5) is given in the Appendix. The 
proof is derived under the minimum-phase assumption 
that constrains the poles and zeros of the ARMA transfer 
function to lie within the unit circle. 



C. Step 3: Inter-frame parameter tracking 



B. Step 2: Intra- frame observation generation 

1. ARMA model of speech 

Following windowing and pre-emphasis, the acoustic 
waveform Si[?Ti] is modeled as a stochastic ARMA(p,q) 
process: 



st[m\ 



i=\ 



aisAm- 






]\ + u\m\ 



(2) 



where a^ are the -p AR coefficients, hj are the q MA coef- 
ficients, and u\m\ is the stochastic excitation waveform. 
The z-domain transfer function associated with Eq. (2) 
is 



T{z) 



1 + E 



h 7~3 
■ 1 "j"^ 



(3) 



A number of standard spectral estimation techniques can 
be employed in order to fit data to the ARMA(p, q) 
model (see Marelli and Balazs, 2010, for a recent review 
of ARMA estimation methods). In the current study, 
ARMA estimation was performed using the 'armax' func- 
tion in MATLAB's System Identification toolbox (The 
MathWorks, Natick, MA), which implements an iterative 
method to minimize a quadratic error prediction criterion 
(Ljung, 1999, Section 10.2). 

2. Generation of observations: ARMA cepstral coefficients 

In the proposed approach, the ARMA spectral coeffi- 
cients in Eq. (3) are transformed to the complex cepstrum 
before inferring formant characteristics. This mapping 
from ARMA spectral coefficients to ARMA cepstral co- 
efficients has been derived in the all-pole case (e.g., Deng 
et ai, 2006a) and can be extended to account for the 
presence of zeros in the spectrum. Letting C„ denote the 
nth cepstral coefficient. 



L'ri 



(4) 



where C„ depends on separate contributions from the de- 
nominator and numerator of the ARMA model through 
the following recursive relationships: 

Un if n = 1 

an + I]r=i^ (^) a«-jCi if l<n<p (5a) 



The proposed algorithm tracks point estimates and un- 
certainties for / formants and J antiformants from frame 
to frame. To accommodate the temporal dimension, the 
parameters of frame t are placed in column vector Xt : 



Xt 



{h...fi b,...hj fi...fj b[. 



■b'.jf: 



(6) 



where (/i, bi) is the frequency /bandwidth pair of the ith 
formant and (/', 5') is the frequency/bandwidth pair for 
the jth antiformant. 



1. Observation model 

Inference of the output parameters is facilitated by a 
closed-form mapping from the state vector Xt to the ob- 
served cepstral coefficients C„ in Eq. (4). Extending the 
speech production model of Schafer and Rabiner (1970) 
to capture zeros, we assume that the transfer function 
T{z) of the ARMA model can be written as a cascade 
of / second-order digital resonators and J second-order 
digital anti-resonators: 



T(z) 






(7) 



where {ai,ai) and {j3j,Pj) denote complex-conjugate 
pole and zero pairs, respectively. Each pole and zero are 
parameterized by a center frequency and 3-dB bandwidth 
(both in units of Hertz) using the following relations: 



{ai,ai) = exp 



iPj.lij) =exp 



-TTfe, ± 27ry^/, ^ 

fs ) 

-Tib'j ± 27rV^/j ' 

Is 



(8a) 
(8b) 



where fa is the sampling rate (in Hz). 

Performing a Taylor-series expansion of logT(z) yields 



logTW^tS^^^^'-"- 



= 1 n=l 



J °° '^"+^j 






(9) 
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Recalling that C„ is the nth cepstral coefficient, 
logr(z) — Co + X]^i C'nZ""- Thus, equating the co- 
efficients of powers of z~^ leads to 



TABLE I. Extended Kalman smoother algorithm. 



Cn 



i^K+^")-^E(/^;+/^/ 



i=i 



(10) 



Finally, inserting a^ and /3j from Eqs. (8) into Eq. (10) 
yields the following observation model h{xt) that maps 
elements of Xt to C„: 



h{xt) =C„ 



^cxp 



,7 

E 



exp 



Trn \ /27m 

7i'»^,A /27rn 
-^^, j cos (^ — /, 



(11) 



Pt\t-iHt (HtPtit-iHt +R 



1. Initiahzation; Set rwoio = Mo and P(j|o = So 

2. Filtering: Repeat for i = 1, . . . , T 

mt|t-i = Fmt-i|t-i 

mt|t = rritit-i + Kt{yt - h{mt\t-i)) 
Pt\t = Pt\t-i — KtHtPt\t-i 

3. Smoothing: Repeat for t — T, . . . ,1 

St = Pt-i\t-iF Pj|t_i 
mt_i|T = mt-i\t-i + St {mt\T - Fmt-i\t-i) 

Pt-1\T = -Pt-l|t-l + St {Pt\T — Pt-l\t-l) St 



(16) 



2. State-space model 

We adopt a state-space framework similar to that by 
Deng et al. (2007) to model the evolution of the state 
vector in Eq. (6) from frame t to frame i -|- 1: 



xt+i = Fxt + Wt, 
yt = h{xt) +vt, 



(12a) 
(12b) 



where F is the state transition matrix, and Wt and Vt 
are uncorrelated white Gaussian sequences with covari- 
ance matrices Q and i?, respectively. The function h{xt) 
is the nonlinear mapping of Eq. (11), and vector yt con- 
sists of estimates of the first N cepstral coefficients of C„ 
(not including the zeroth coefficient). The initial state 
Xq follows a normal distribution with mean /xq and co- 
variance So- The state-space model of Eqs. (12) is thus 
parameterized by the set 6: 



0^(F,g,i?,/xo,So) 



3. Linearization via Taylor approximation 



(13) 



The Jacobian matrix Ht thus consists of four sub- 
matrices for each frame t: 



Ht ^ {H{u) H{h) H{j') nm) 



(14) 



where H{fi) and H{bi) each consists of N rows and p/2 
columns: 
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(15a) 



(15b) 



and i?(/') and H(b'A are defined analogously each with 
N rows and q/2 columns. 



The cepstral mapping in Eq. (11) can be lin- 
earized to enable approximate minimum-mcan-square- 
error (MMSE) estimates of the tracked states via the 
extended Kalman filter. The mapping h{xt) is linearized 
by computing the first-order terms of the Taylor-series 
expansion of C„ in Eq. (11): 
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4. Kalman-based inference 

Given observations yt for frame indices 1 to T, the 
extended Kalman smoother (EKS) can be used to com- 
pute the mean mt\T (point estimates) and covariance 
Pt\T (estimate uncertainties) of each parameter in Xt- 
Table I displays the steps of the EKS, which employs a 
two-pass filtering (forward) and smoothing (backward) 
procedure. For real-time processing, the forward filter- 
ing stage may be applied without a backward smoothing 
procedure; naturally, this will lead to larger uncertainties 
in the corresponding parameter estimates. 

Care must be taken when approximating the observa- 
tion model of Eq. (11) to avoid suboptimal performance 
or algorithm divergence in the case of the Kalman fil- 
ter (Julier and Uhlmann, 1997). To verify the appro- 
priateness of the linearization in Section II.C.3 in this 
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FIG. 2. (Color online) Comparison of KARMA (blue) and 
particle filter (red) tracking performance in terms of root- 
mean-square error (RMSE) averaged over 25 Monte Carlo 
trials and reported with 95 % confidence intervals (gray). 



setting, comparisons are made to a more computation- 
ally intensive method of stochastic computation termed 
a particle filter, which approximates the densities in ques- 
tion by sequentially propagating a fixed number of sam- 
ples, or "particles," and hence avoids the linearization of 
Eq. (11). 

Twenty-five Monte Carlo simulations of 100-sample 
data sequences were performed according to Eqs. (12) 
with four complex-conjugate pole pairs {p — 8, q — 0) 
and A^ = 15 cepstral coefficients. Fig. 2 compares the 
output of KARMA using the extended Kalman filter of 
Table I and that of a particle filter in terms of root-mean- 
square error (RMSE) as a function of the number of par- 
ticles, averaged over all formant frequency tracks (true 
bandwidths were provided to both algorithms) . The per- 
formance of the EKF compares favorably to that of the 
particle filter, even when a large number of particles is 
used. Similar results hold over a broad range of parame- 
ter values. 



5. Observability of states 

The model of Eq. (12) does not explicitly take into 
account the existence of speech and non-speech states. 
To continue to track or coast parameters during silence 
frames, the state vector Xt can be augmented with a bi- 
nary indicator variable to specify the presence of speech 
in the frame. The approximate MMSE state estimate 
is then obtained via EKS inference by modifying the 
Kalman gain in Eq. 16: 

Kt - MtPt\t-iH'[ {HtPt\t-iHj + Ry\ 

where Mf is a diagonal matrix with diagonal entries equal 
to 1 or depending on the presence or absence, respec- 
tively, of speech energy in frame t. 

In addition, to handle the presence or absence of 
particular tracks, the state vector x^ can be dynam- 
ically modified to include or omit corresponding fre- 
quency/bandwidth states in Eq. (6). The approximate 
MMSE state estimate is then obtained via EKS inference 
in Table I with the modified state vector. If an absent 



state reappears in a given frame, that state is reinitialized 
with corresponding entries in fig and Sq- 



6. Model order selection and system identification 

As is commonly done, the orders p and q of the ARMA 
model are chosen to capture as much information as pos- 
sible on the peaks and valleys in the resonance spec- 
trum, while avoiding overfitting and mistakenly captur- 
ing source-related information. The ARMA cepstral or- 
der N is chosen to be at least max(p, q) so that all 
pole/zero information is incorporated per Eq. (5). Fi- 
nally, selecting / and J in Eq. (11) depends on the ex- 
pected number of formants and anti-formants, respec- 
tively, in the speech bandwidth /s/2. 

Formants do not evolve independently of one another, 
and their temporal trajectories are not independent in 
frequency. In the synthesis of front vowels, it is common 
practice to employ a linear regression of f^ onto /i and 
/2 (Nearey, 1989, e.g.). Empirically, we found the for- 
mant cross-correlation function to decay slowly (Rudoy 
et al., 2007), implying that a set of formant values at 
frame t might be helpful in predicting values of all for- 
mants at frame t + 1. Thus, instead of setting the state 
transition matrix F to the identity matrix (Deng et al., 
2G06a, 2007), F is estimated a priori for a particular 
utterance from first-pass WaveSurfer formant frequency 
tracks using a linear least-squares estimator (Hamilton, 
1994). 

The state transition covariance matrix Q, which dic- 
tates the frame-to-frame frequency variation, consists of 
a diagonal matrix with values corresponding to standard 
deviations of approximately 320 Hz for center frequencies 
and 100 Hz for bandwidths. These values were empiri- 
cally found to follow temporal variations of speech articu- 
lation. The covariance matrix i?, representing the signal- 
to- noise ratio of the cepstral coefhcients, is a diagonal 
matrix with elements Rnn — ^/n for n g {1, 2, . . . , N}. 
This was observed to be in reasonable agreement with the 
variance of the residual vector of the cepstral coefhcients 
derived from speech waveforms. 

The center frequencies and bandwidths are initialized 
to fj,o = (500 1500 2500 80 120 160) Hz. The initial 
covariance So is set to Q. 



D. Summary of KARMA approach 

Table II outlines the steps of the proposed KARMA 
algorithm, which includes a pre-processing stage, intra- 
frame ARMA cepstral coefficient estimation, and inter- 
frame tracking of formant and antiformant parameters 
using Kalman inference. 



E. Benchmark algorithms 

Performance of KARMA is compared with that of 
Praat (Boersma and Weenink, 2009) and WaveSurfer 
(Sjolander and Beskow, 2005), two software packages 
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TABLE II. Proposed KARMA algorithm for formant and an- 
tiformant tracking. 

Repeat for frames i = 1, . . . , T (Online or batch mode) 

1. Pre-processing of input speech waveform s[m] 

(a) Window: St[m\ = s[m]wt[7n] 

(b) Pre-emphasize St[m] 

2. Intra- frame observation of A'' cepstral coefficients 

(a) Estimate ARMA(p, q) spectral coefficients en and 
bj in Eq. (3) 

(b) Convert Ui and bj to ARMA cepstral coefficients 
using Eq. (4) 

3. Inter- frame parameter tracking of I formants and J an- 
tiformants 

(a) Apply Kalman filtering step in Table I 

(b) rrit^t are point estimates and diagonal elements of 
Pt\t are associated variances of the estimates 

Repeat for frames t = T, . . . ,1 (Batch mode only) 

4. Inter-frame parameter tracking of I formants and J an- 
tiformants 

(a) Apply Kalman smoothing step in Table I 

(b) mt\T are point estimates and diagonal elements 
of PtjT are associated variances of the estimates 



that see wide use among voice and speech researchers. 
WaveSurfer and Praat both follow the classical formant 
tracking approach in which frame-by-frame format fre- 
quency candidates are obtained from the all-pole spec- 
trum and smoothed across the entire speech utterance to 
remove outliers and constrain the trajectories to physi- 
ologically plausibile values. Smoothing is accomplished 
through dynamic programming to minimize the sum of 
the following three cost functions: (1) the deviation be- 
tween the frequency for each formant from baseline val- 
ues of each frequency; (2) a measure of the quality fac- 
tor fi/bi of a formant, where higher quality factors are 
favored; and (3) a transition cost that penalizes large 
frequency jumps. The user sets weights to these cost 
functions to tune the algorithm's performance. 



A. Error analysis using a hand-corrected formant database 

The VTR database contains a representative subset 
of the TIMIT speech corpus (Garofolo et ai, 1993) that 
consists of 516 diverse, phonetically-balanced utterances 
collated across gender, individual speakers, dialects, and 
phonetic contexts. The VTR database contains state in- 
formation for four formant trajectory pairs (center fre- 
quency and bandwidth) . The first three center frequency 
trajectories were manually corrected after an initial au- 
tomated pass (Deng et at, 2004). Corrections were 
made using knowledge-based intervention based on the 
speech waveform, its wideband spectrogram, word- and 
phoneme-level transcriptions, and phonemic boundaries. 

Analysis parameters of KARMA are set to the follow- 
ing values: /s = 7 kHz, 20 nis Hamming windows with 
50 % overlap, 7 = 0.7, p ^ 12 {q ^ 0), and / = 3 
(J =: 0). Each frame is fit with an ARMA(12,0) model 
using the autocorrelation method of linear prediction 
and, subsequently, transformed to A^ = 15 cepstral co- 
efficients via Eq. (5). The initial state vector is set to 

xo = (500 1500 2500)^, and Eq is set to Q. TIMIT 
phone transcriptions are used to indicate whether each 
frame contains speech energy or a silence region. A frame 
is considered silent if all its samples are labeled as a pause 
(pau, epi, /i#), closure interval {bcl, del, gel, pel, tcl, kel), 
or glottal stop (g). Thus, errors due to speech activity 
detection are minimized, and all tracks are coasted dur- 
ing silent frames. 

Default smoothing settings are set within WaveSurfer 
and Praat. Other analysis parameters are matched to 
KARMA: fs ~ 7 kHz, 20 ms Hamming windows with 
50 % overlap, 7 = 0.7, p = 12 (g = 0), and / = 3 
(J^O). 

Table HI summarizes the performance of KARMA, 
WaveSurfer, and Praat on the VTR database. The root- 
mean-square error (RMSE) per formant is computed over 
all speech frames for each utterance and then averaged, 
per formant, across all 516 utterances in the database. 
The cepstral-based KARMA approach results in lower 
overall error compared to the classical algorithms, with 
particular gains for /i and /2 tracking. Praat exhibits 
the lowest average error for /a. 

Figure 3 illustrates the formant tracks output from the 
three algorithms for VTR utterance 200 spoken by an 
adult female. During non-speech regions (the first 700 ms 
exhibits noise energy during inhalation), mean KARMA 



III. RESULTS 

Evaluation of KARMA is accomplished in the all-pole 
case using the vocal tract resonance (VTR) database 
(Deng et at, 2006b). Since the VTR database itself only 
yields estimates of ground truth and exhibits observable 
labeling errors, two speech databases are created using 
overlap-add of synthesis speech frames using the four 
VTR formant tracks. Antiformant tracking performance 
of KARMA is illustrated using synthesized and spoken 
nasal phonemes. 



TABLE III. Formant tracking performance of KARMA, 
WaveSurfer, and Praat in terms of root-mean-square error 
(RMSE) per formant averaged across 516 utterances in the 
VTR database (Deng et ai, 2006b). RMSE is only computed 
over speech-labeled frames. 



Formant 


KARMA 


WaveSurfer 


Praat 


/i 


114 Hz 


170 Hz 


185 Hz 


/2 


226 Hz 


276 Hz 


254 Hz 


/3 


320 Hz 


383 Hz 


303 Hz 


Overall 


220 Hz 


276 Hz 


247 Hz 
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FIG. 3. (Color online) Estimated formant tracks on spectrogram of VTR utterance 200: "Withdraw only as much money as 
you need." Reference trajectories from the VTR database are shown in red along with the formant frequency tracks in blue 
from (A) KARMA, (B) WaveSurfer, and (C) Praat. The KARMA output additionally displays uncertainties (gray shading, ±1 
standard deviation) for each formant trajectory and speech-labeled frames (green). Reported root-mean-square error (RMSE) 
is averaged across formants conditioned on speech presence for each frame. 



track estimates are linear with increasing uncertainty for 
frames that are farther from frames with formant infor- 
mation. Compared to the WaveSurfer and Praat tracks, 
KARMA trajectories are smoother and better behaved, 
reflecting the slow-moving nature of the speech articula- 
tors. The classical algorithms exhibit errant tracking of 
/2 during the /i/ vowel in "need" at 2.5 s that is handled 
by the KARMA approach. 



B. Error analysis using synthesized databases 

Speech waveforms in the first database (VTRsynth) 
are synthesized through overlap-add of frames that each 
foflow the ARMA model of Eq. (2). ARMA spec- 
tral coefficients are derived from the four formant fre- 
quency/bandwidth pairs in the corresponding frame of 
the VTR database utterance using the impulse-invariant 
transformation of a digital resonator (Klatt, 1980). The 
source excitation is white Gaussian noise during non- 
silence frames. Synthesis parameters are set to the fol- 
lowing values: fs = 16 kHz, 20 ms Hanning windows 
with 50 % overlap, and p = 8 {q — 0). 

The second database (VTRsynthfO) introduces a 
model mismatch between synthesis and KARMA anal- 
ysis by applying a Rosenberg C source waveform (Rosen- 
berg, 1971) instead of white noise for each frame consid- 
ered voiced. The fundamental frequency of each voiced 
frame in the original VTR database is estimated by 
WaveSurfer. The VTRsynthfO database thus includes 
voiced, unvoiced, and non-speech frames. Synthesis pa- 
rameters are set as in the VTRsynth database. Formant 



trajectories from these two synthesized databases act as 
truer ground truth contours than in the VTR database 
to test the performance of the cepstral-based KARMA 
algorithm. 

Table IV and Table V display performance on the 
VTRsynth and VTRsynthfO databases, respectively, of 
the three tested algorithms with settings as described in 
the previous section. The proposed KARMA approach 
compares favorably to WaveSurfer and Praat. The sim- 
ilar error of KARMA and WaveSurfer validates the use 
of ARMA cepstral coefficients as observations in place of 
ARMA spectral coefficients. 



C. Antiformant tracking 

The KARMA approach to formant and antiformant 
tracking is illustrated in this section. Synthesized and 
real speech examples are presented to determine the abil- 



TABLE IV. Average RMSE of KARMA, WaveSurfer, and 
Praat formant tracking of the first three formant trajecto- 
ries in the VTRsynth database that resynthesizes utterances 
using a stochastic source and formant tracks from the VTR 
database. Error is only computed over speech-labeled frames. 



Formant 


KARMA 


WaveSurfer 


Praat 


/i 


29 Hz 


37 Hz 


58 Hz 


/2 


53 Hz 


60 Hz 


123 Hz 


fs 


64 Hz 


54 Hz 


130 Hz 


Overall 


48 Hz 


50 Hz 


104 Hz 
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ity of the ARMA-derivcd ccpstral coefficients to capture 
pole and zero information. 



1. Synthesized waveform 

In the synthesized case, a speech-hke waveform /nan/ 
is generated with varying frame-by-frame formant and 
antiformant characteristics and a periodic source excita- 
tion as was implemented for the VTRsynthfO database. 
The /nan/ waveform is synthesized at fs — 10 kHz us- 
ing 75 100 ms frames with 50 % overlap. Formant fre- 
quencies (bandwidths) of the /n/ phonemes are set to 
257 Hz (32 Hz) and 1891 Hz (100 Hz). One antiformant 
is placed at 1223 Hz (bandwidth of 52 Hz) to mimic the 
location of an alveolar nasal antiformant. Formant fre- 
quencies (bandwidths) of /a/ were set to 850 Hz (80 Hz) 
and 1500 Hz (120 Hz). A random term with zero mean 
and standard deviation of 10 Hz was added to each tra- 
jectory to simulate realistic variation. 

Fig. 4 shows the results of formant and antiformant 
tracking using KARMA on the synthesized phoneme 
string /nan/. Two different visualizations are displayed. 
Fig. 4A plots point estimates and uncertanties of the cen- 
ter frequency and bandwidth trajectories for each frame. 
Fig. 4B displays the wideband spectrogram with overlaid 
center frequency tracks whose width reflects the corre- 
sponding 3-dB bandwidth value. Note that the length 
of the state vector in the KARMA's state-space model 
is modifled depending on the presence or absence of an- 
tiformant energy. Estimated trajectories fit the ground 
truth values well once initialized values reach a steady 
state. 



2. Spoken nasals 

During real speech, a vocal tract configuration consist- 
ing of multiple acoustic paths results in the possible ex- 
istence of both poles and zeros in transfer function T{z) 
(Eq. 7). For example, the effects of antircsonances might 
enter the transfer function of nasalized speech sounds as 
zeros in T{z). Typically, the frequency of the lowest zero 
depends on tongue position. For the labial nasal conso- 
nant /m/, the frequency of this antiresonance is approxi- 
mately 1100 Hz. As the point of closure moves toward the 
back of the oral cavity — such as for the alveolar and velar 



TABLE V. Average RMSE of KARMA, WaveSurfer, and 
Praat formant tracking of the first three formant trajectories 
in the VTRsynthfO database that resynthesizes VTR database 
utterances using stochastic and periodic sources. Error is only 
computed over speech-labeled frames. 



Formant 


KARMA 


WaveSurfer 


Praat 


h 


44 Hz 


57 Hz 


57 Hz 


h 


53 Hz 


58 Hz 


117 Hz 


h 


62 Hz 


59 Hz 


111 Hz 


Overall 


53 Hz 


58 Hz 


95 Hz 



nasal consonants — the length of the resonator decreases, 
and the frequency of this zero increases. The frequency 
of a second zero is approximately three times the fre- 
quency of the lowest zero due to the quarter- wavelength 
oral cavity configuration. 

KARMA performance was evaluated visually on spo- 
ken nasal consonants produced with closure at the 
labial (/m/), alveolar (/n/), and velar (/r)/) positions. 
The extended Kalman smoother was applied using an 
ARMA(16,4) model, /s = 8 kHz, 20 ms Hamming win- 
dows with 50 % overlap, A^ = 20 cepstral coefficients, and 
7 = 0.7. The frequencies (bandwidths) of the formants 
were initiahzed to 500 Hz (80 Hz), 1500 Hz (120 Hz), 
and 2500 Hz (160 Hz). The frequencies (bandwidths) of 
the antiformants were initialized to 1000 Hz (80 Hz) and 
2000 Hz (80 Hz). 

Figure 5 displays KARMA outputs (point estimate and 
uncertainty of frequency tracks) and averaged spectra for 
the three sustained consonants. The KARMA algorithm 
takes a few frames to settle to its steady-state estimates. 
As expected, the frequency of the antiformant increases 
as the position of closure moves toward the back of the 
oral cavity. The uncertainty of the first antiformant of 
/r)/ increases significantly, indicating that this antifor- 
mant is not well observed in the waveform. Note that 
the inclusion of zeros greatly improves the ability of the 
ARMA model to fit the underlying waveform spectra. 

Finally, the KARMA tracker was applied to the spo- 
ken word "piano" to determine if the antiformant tracks 
would capture any zeros during the nasal phoneme. Fig- 
ure 6 displays the KARMA formant and antiformant 
tracks with their associated uncertainties. During the 
non-nasalized regions, the uncertainty around the point 
estimates of the antiformant track is large, refiecting the 
lack of antiresonance information. During the /n/ seg- 
ment, the uncertainty of the antiformant tracks decreases 
to reveal observable antiformant information. 



IV. DISCUSSION 

In this article, the task of tracking frequencies and 
bandwidths of formants and antiformants was ap- 
proached from a statistical point of view. The evolution 
of parameters was cast in a state-space model to provide 
access to point estimates and uncertainties of each track. 
The key relationship was a linearized mapping between 
cepstral coefficients and formant and antiformant param- 
eters that allowed for the use of the extended family of 
Kalman inference algorithms. 

The VTR database provides an initial benchmark of 
"ground truth" for the first three formant frequency val- 
ues to which multiple algorithm outputs can be com- 
pared. The values in the VTR database, however, should 
be interpreted with caution because starting values were 
initially obtained via a first-pass automatic algorithm 
Deng et al. (2004). It is unclear how much manual in- 
tervention was required and what types of errors were 
corrected. In particular, VTR tracks do not always over- 
lap high-energy spectral regions. Despite the presence of 
various labeling errors in the VTR database, it is still 
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FIG. 4. (Color online) Illustration of the output from KARMA for the synthesized utterance /nan/. Plots in panel A overlay 
the true trajectories (red) with the mean estimates (blue for formants, green for antiformants) and uncertainties (gray shading) 
for each frequency and bandwidth. Panel B plots an alternative display with a wideband spectrogram along with estimated 
frequency and bandwidth tracks of formants (blue) and antiformants (green). The 3-dB bandwidths dictate the width of the 
corresponding frequency tracks. 



useful to obtain initial performance of formant tracking 
algorithms on real speech. 

In the current framework, cepstral coefficients are de- 
rived from the spectral coefficients of the fitted stochas- 
tic ARMA model (Section II. B). Source information re- 
lated to phonation is thus separated from vocal tract res- 
onances by assuming that the source is a white Gaussian 
noise process. This is not the case in reality, especially 
for voiced speech, where the source excitation component 
has its own characteristics in frequency (e.g., spectral 
slope) and time (e.g., periodicity). This model mismatch 
has been explored here via VTRsynthfO, though we note 
that it is also possible to incorporate more sophisticated 
source modeling through the use flexible basis functions 
such as wavelets (Mehta et at, 2011). 

An alternative approach to ARMA modeling is to com- 
pute the nonparametric (real) cepstrum directly from 
the speech samples. Based on the convolutional model 
of speech, low-quefrency cepstral coefficients are largely 
linked to vocal tract information up to about 5 ms 
(Childers et al, 1977), depending on the fundamental 
frequency. Skipping the 0th cepstral coefficient that 
quantifies the overall spectral level, this would trans- 
late to including up to the first 35 cepstral coefficients 
in the observation vector yt in the state-space framework 
(Eq. 12). Although coefficients from the real cepstrum do 



not strictly adhere to the observation model derived in 
Eq. 11, the approximate separation of source and filter in 
the nonparametric cepstral domain makes this approach 
viable. 

Figure 7 illustrates the output of the algorithm using 
the first 15 coefficients of the real cepstrum as obser- 
vations. Interestingly the performance of the nonpara- 
metric cepstrum is comparably to that of the paramet- 
ric ARMA cepstrum, espeically for the first formant fre- 
quency. Most of the error stems from underestimating 
the second and third formant frequencies. Advantages to 
using the nonparametric cepstrum include computational 
efficiency and freedom from ARMA model constraints. 

The capability of automated methods to track the 
third formant strongly depends on the resampling fre- 
quency, which controls the amount of energy in the spec- 
trum at higher frequencies. For example, if the signal 
were resampled to 10 kHz, a given algorithm might erro- 
neously track the third formany frequency through spec- 
tral regions typically ascribed to the fourth formant. Tra- 
ditional formant tracking algorithms have access to mul- 
tiple candidate frequencies, which are constantly resorted 
so that /4 > /s > /2 > /i- In the proposed statisti- 
cal approach, the ordering of formant indices is inher- 
ent in the mapping of formants to cepstral coefficients 
(11), and further empirical study of this formants-to- 
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FIG. 5. (Color online) KARMA output for three spoken nasals: (A) /m/, (B) /n/, and (C) /i]/. On the left, spectrograms 
overlay the mean estimates (blue for formants, green for antiformants) and uncertainties (gray shading) for each frequency and 
bandwidth. Plots to the right display the corresponding periodogram (gray) and spectral ARM A model fit (black). 
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FIG. 6. (Color online) KARMA formant and antiformant tracks of utterance by adult male: "piano." Displayed are the 
(A) wideband spectrogram of the speech waveform and (B) the spectrogram overlaid with formant frequeny estimates (blue), 
antiformant frequency estimates (green), and uncertainties (±1 standard deviation) for each track (gray). Arrows indicate 
beginning and ending of utterance. Note that the increase in uncertainty during silence regions. 



cepstruni mapping can be expected to lead to improved 
methods when there are additional resonances present in 
the speech bandwidth. Further analysis of "noise" in the 
estimated ARMA cepstrum can also be expected to im- 
prove overall robustness in the presence of various sources 



of uncertainty (Tourncret and Lacaze, 1995). 

Overall, the proposed KARMA approach compares fa- 
vorably with WaveSurfer and Praat in terms of root- 
mean-square error. RMSE, however, is only one selected 
error metric, which must be validated by observing how 
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FIG. 7. (Color online) KARMA formant tracks using observations from (A) parametric ARMA cepstrum and (B) nonparametric 
real cepstrum for VTRsynthfO utterance 1: "Even then, if she took one step forward, he could catch her." Reference trajectories 
from the VTR database are shown in red with the outputs of KARMA in blue. KARMA uncertainties (±1 standard deviation) 
are shown as gray shading. Reported root-mean-square error (RMSE) averages across 3 formants conditioned on the presence 
of speech energy for each frame. 



well raw trajectories behave. The proposed KARMA 
tracker yields smoother outputs as a result and offers pa- 
rameters that allow the user to tune the performance of 
the algorithm in a statistically principled manner. Such 
well behaved trajectories may be particularly desirable 
for the resynthesis of perceptually natural speech. 

Though considerably more complex and more sensi- 
tive to model assumptions, a time-varying autoregres- 
sive moving average (TV- ARMA) model has been pre- 
viously proposed for formant and antiformant tracking 
(Toyoshima et al., 1991) with little follow-up investiga- 
tion. In their study, Toyoshima et al. used an extended 
Kalman filter to solve for ARMA spectral coefficients at 
each speech sample. One real zero and one real pole 
were included to model changes in gross spectral shape 
over time. While a franre-based approach (as taken in 
KARMA) appears to yield more salient parameters at a 
lower computational cost, future work could consider this 
as well as alternative time-varying approaches (Rudoy 
at al, 2011). 

As a final observation, antiformant tracking remains 
a challenging task in speech analysis. Antiresonances 
are typically less strong than their resonant counterparts 
during nasalized phonation, and the estimation of sub- 
glottal resonances continues to rely on empirical relation- 
ships rather than direct acoustic observation (Arsikere 
et al., 2011). Nevertheless, the proposed approach allows 
the user the option of tracking antiformants during select 
speech regions of interest. Potential improvements here 
include the use of formal statistics tests for detecting the 
presence of zeros within a frame prior to tracking them. 



V. CONCLUSIONS 



work are twofold. The first is methodological, with im- 
provements to the Kalman-based AR approach of Deng 
et al. (2007) and extensions to enable antiformant fre- 
quency and bandwidth tracking in a KARMA frame- 
work. The second is empirical, with visual and quan- 
titative error analysis of the KARMA algorithm demon- 
strating improvements over two standard speech process- 
ing tools, WaveSurfer (Sjolander and Beskow, 2005) and 
Praat (Boersma and Weenink, 2009). 

It is expected that additional improvements will come 
with better understanding of precisely how formant infor- 
mation is captured through this class of nonlinear ARMA 
(or nonparametric) cepstral coefficient models. As noted, 
antiformant tracking remains challenging, although it has 
been shown here that appropriate results can be obtained 
for selected cases exhibiting antiresonances. The demon- 
strated effectiveness of this approach, coupled with its 
ability to capture uncertainty in the frequency and band- 
width estimates, yields a statistically principled tool ap- 
propriate for use in clinical and other applications where 
it is desired, for example, to quantitatively assess acous- 
tic features such as nasality, subglottal resonances, and 
coarticulation. 



APPENDIX: DERIVATION OF CEPSTRAL 
COEFFICIENTS FROM THE ARMA SPECTRUM 

Assume an ARMA process with the nrinimum-phase 
rational transfer function 



T(z) A ^ 



ELl 0■^ 



(A.l) 



This article has presented KARMA, a Kalman-based 
autoregressive moving average modeling approach to for- 
mant and antiformant tracking. The contributions of this 



which in turn implies a right-sided complex cepstrum. 
For the moment, assume hj = for 1 < j < q to initially 
derive the all-pole LPC cepstrum whose Z-transform is 
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denoted by C{z): 

oo 

C(z)^logT(z) = ^c„z-", (A.2) 

n=0 

where 

is the nth coefBcient of the LPC ccpstrum. 

Using the chain rule, j^T{z) can be obtained in- 
dependently from Eq. (A.l) or Eq. (A.2), yielding the 
relation 

dC{z) _ 1 dT{z) _ J2^,^iia^z-'+^ 



dz 1 T{z) dz ^ 1 — X]r=i '^i^ 
which implies that 



E^«7^(--") = E 



Rearranging the terms above, we obtain 

oo P P 



nCnZ 



-n+1 



E 



nCnZ 



-n+ 



= y ittiZ ^^ + y aiZ * y nCni 



i=l n=0 



(A.3) 

Using Eq. (A.3), we can match the coefhcients of terms 
on both sides with equal exponents. In the constant- 
coefficient case (associated to z"), we have ci ~ ai. For 
1 < n < p, we obtain 



n— 1 . n— 1 



+ E 



■^'t'-n—i '^~ ^n 



E ' 



i=l i=l 

On the other hand, if n > p, then Eq. (A.3) implies that 



n— 1 . n— 1 / 

En — I Y^ / * 

i—n—p i—1 



In summary, we have obtained the following relationship 
between the prediction polynomial coefficients and the 
complex cepstrum: 

fli if n = 1 

an+Yn=l {'^)^i^n-i ^^ l<n<p 
E7^n-A^)^^Cn-^ if P < n. 

Reversing the roles of i and (n — i) yields the all-pole 
version in Eq. (5a). 

To allow for nonzero bj coefficients in Eq. (A.l), we 
obtain the ARMA cepstral coefficients C„ by separating 
contributions from the numerator and denominator of 
Eq. (A.l) as follows: 

C„ = Z-i log T(z) 

= Z-Mog^^ -Z^Mo: ^ 



A{z) 



Biz) 



Cn C„ , 



yielding the respective pole and zero recursions of 
Eqs. (5). 
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